In this project I wanted to explore how wine quality of a white wines dataset is influenced by chemical measurements of the wine. I explored the dataset looking for the features that have the highest impact on wine quality and I tried to find a linear model that can predict the wine quality, given a set of wine features.
To begin, I wanted to explore the dataset features summary.
## 'data.frame': 4898 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ score : Ord.factor w/ 9 levels "1"<"2"<"3"<"4"<..: 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality score
## Min. : 8.00 Min. :3.000 6 :2198
## 1st Qu.: 9.50 1st Qu.:5.000 5 :1457
## Median :10.40 Median :6.000 7 : 880
## Mean :10.51 Mean :5.878 8 : 175
## 3rd Qu.:11.40 3rd Qu.:6.000 4 : 163
## Max. :14.20 Max. :9.000 3 : 20
## (Other): 5
I noticed that all variables are numeric and that quality variable could be transformed to an ordered factor, so I added a new variable called “score” (this code has been moved at the beginning of the file) in order to have both quality and score variable for each observation. The summary function gave me a sense of the variables distribution, but I’m going to explore all the variables by plotting their distributions:
## Warning: position_stack requires constant width: output may be incorrect
The data set has 4898 observations of 13 variables:
$ X : int -> Progressive number
$ fixed.acidity : num 3.8 - 14.2
$ volatile.acidity : num 0.08 - 1.1
$ citric.acid : num 0.00 - 1.17
$ residual.sugar : num 0.6 - 65.8
$ chlorides : num 0.009 - 0.34
$ free.sulfur.dioxide : num 2.0 - 289.0
$ total.sulfur.dioxide: num 9.0 - 440.0
$ density : num 0.987 - 1.039
$ pH : num 2.72 - 3.82
$ sulphates : num 0.22 - 1.08
$ alcohol : num 8.0 - 14.2
$ quality : int 3 - 9
Alcohol variable is more widely distributed, almost linearly between 9.9 and 12.
Quality is an integer type, but can be considered as an ordinated factor, so I created the “score” ordered factor with quality value.
The main features I was interested in were quality and alcohol.
All the features are interesting. I suppose that wines with low acidity, chlorides and sulphates will score better than other wines.
Yes, I created a “score” variable, which is an ordinated factor of the “quality” variable. Each one of the possible quality values (int numbers from 3 to 9) was transformed into a factor. I wanted to keep both quality and score as “int” and “ordered factor” variables because int can be used to calculate correlation while ordered factor can be used to separate observations into groups.
The alcohol distribution was unusual, it was not gaussian. The dataset was already in tidy format and I did not have to make adjustments. As described above, I transformed the “quality” integer variable into an ordinated factor called “score”.
I started by exploring the ggpairs matrix on a sample of 1000 observations on the dataset. I renamed the dataset features in order to make the plot more readable, but I was unable to suppress warnings and resize the correlation font size. I printed the correlation summary after the plot.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Since the correlation is unreadable on the ggpairs plot, here there is a more readable version
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## alcohol 0.213656245 -0.12088112 0.067717943
## quality 0.035763247 -0.11366283 -0.194722969
## citric.acid residual.sugar chlorides
## X -0.149899918 0.006623775 -0.04564519
## fixed.acidity 0.289180698 0.089020701 0.02308564
## volatile.acidity -0.149471811 0.064286060 0.07051157
## citric.acid 1.000000000 0.094211624 0.11436445
## residual.sugar 0.094211624 1.000000000 0.08868454
## chlorides 0.114364448 0.088684536 1.00000000
## free.sulfur.dioxide 0.094077221 0.299098354 0.10139235
## total.sulfur.dioxide 0.121130798 0.401439311 0.19891030
## density 0.149502571 0.838966455 0.25721132
## pH -0.163748211 -0.194133454 -0.09043946
## sulphates 0.062330940 -0.026664366 0.01676288
## alcohol -0.075728730 -0.450631222 -0.36018871
## quality -0.009209091 -0.097576829 -0.20993441
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## alcohol -0.2501039415 -0.448892102 -0.78013762
## quality 0.0081580671 -0.174737218 -0.30712331
## pH sulphates alcohol quality
## X -0.1157741316 0.009807759 0.21365624 0.035763247
## fixed.acidity -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity -0.0319153683 -0.035728147 0.06771794 -0.194722969
## citric.acid -0.1637482114 0.062330940 -0.07572873 -0.009209091
## residual.sugar -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides -0.0904394560 0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide -0.0006177961 0.059217246 -0.25010394 0.008158067
## total.sulfur.dioxide 0.0023209718 0.134562367 -0.44889210 -0.174737218
## density -0.0935914935 0.074493149 -0.78013762 -0.307123313
## pH 1.0000000000 0.155951497 0.12143210 0.099427246
## sulphates 0.1559514973 1.000000000 -0.01743277 0.053677877
## alcohol 0.1214320987 -0.017432772 1.00000000 0.435574715
## quality 0.0994272457 0.053677877 0.43557472 1.000000000
===============================================================
There are too many variables in the above ggpairs plot. In the following I selected the most interesting from the previews ggpairs plot.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
=======================================================================
There is a high correlation between density and residual sugar (0.839), I am curious to see in detail this scatter plot. I filtered the results to exclude outliers and added an alpha to avoid overlplotting.
==================================================
There is also an interesting correlation (-0.78) between density and alcohol.
===================================================
There is a smaller but significative correlation between alcohol and residual sugar (-0.451)
In this plot the majority of the points are lower-left corner, while there are very few dots in the upper right corner. There is also a high concentration of dots on very low residual sugar values, this can be because winemakers tend to last the fermentation as long as possible transforming all the sugar into alcohol.
====================================================
Alcohol has the strongest correlation with quality (0.436), followed by density (-0.307) and chlorides (-0.210) while other variables have a lower impact on the quality.
Quality seems to be strongly correlated with alcohol (0.436), density (-0.307), chlorides (-0.21), volatile acidity (-0.195) and total sulfur dioxide (-0.175).
There is an evident correlation between density and residual sugar (0.839). This is due to the process of fermentation that transforms sugar (dense) to alcohol (less dense). This is confirmed by the negative correlation between residual sugar and alcohol (-0.451).
The strongest relationship I found is between density and residual sugar (0.839). This relationship can be explained by the natural wining process of sugar conversion into alcohol.
I also found another strong relationship between alcohol and wine quality (0.436).
We are most interested in wine quality, I’ll try to use this parameter to color the output and search for patterns.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
========================================================
Let’s have a closer look to alcohol by density and alcohol by residual sugar, colored by quality score
In the following plots I’ll not plot the median quality value (6) to avoid overplotting and have a better distinction between the good and bad wines.
As alcohol increases, we get more quality wines in both plots. In the first one, we can also see that, as the alcohol concentration increases, the density decreases. In the second one we can see that for highly alcoholic wines there is less residual sugar.
=============================================================
In the following plot I’ll avoid plotting the median score value (6).
We can notice that better wines have higher residual sugar for the same density values.
===================================================================
Low density and low volatile acidity have both an impact on the wine quality, but there is no particular pattern correlating the two factors.
==========================================================
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Wines with score 5 or lower are more concentrated on lower alcohol percentage.
===================================================
I created a linear model to see if we can predict quality based on the main correlated features.
##
## Calls:
## m1: lm(formula = (quality ~ alcohol), data = wines)
## m2: lm(formula = quality ~ alcohol + density, data = wines)
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = wines)
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity,
## data = wines)
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides, data = wines)
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity +
## chlorides + total.sulfur.dioxide, data = wines)
##
## =======================================================================================
## m1 m2 m3 m4 m5 m6
## ---------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** 90.313*** 74.225*** 73.271*** 81.344***
## (0.098) (6.165) (12.374) (11.977) (11.999) (12.246)
## alcohol 0.313*** 0.360*** 0.246*** 0.286*** 0.283*** 0.284***
## (0.009) (0.015) (0.018) (0.018) (0.018) (0.018)
## density 24.728*** -87.886*** -71.546*** -70.514*** -78.777***
## (6.079) (12.317) (11.923) (11.949) (12.209)
## residual.sugar 0.053*** 0.052*** 0.052*** 0.053***
## (0.005) (0.005) (0.005) (0.005)
## volatile.acidity -2.059*** -2.044*** -2.077***
## (0.109) (0.110) (0.110)
## chlorides -0.692 -0.769
## (0.540) (0.540)
## total.sulfur.dioxide 0.001**
## (0.000)
## ---------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.210 0.264 0.264 0.266
## adj. R-squared 0.190 0.192 0.210 0.263 0.263 0.265
## sigma 0.797 0.796 0.787 0.760 0.760 0.759
## F 1146.395 583.290 434.085 438.646 351.293 295.042
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5776.812 -5604.126 -5603.301 -5598.094
## Deviance 3112.257 3101.773 3033.737 2827.187 2826.235 2820.233
## AIC 11684.782 11670.255 11563.624 11220.251 11220.603 11212.189
## BIC 11704.272 11696.241 11596.107 11259.231 11266.079 11264.161
## N 4898 4898 4898 4898 4898 4898
## =======================================================================================
Every feature is contributing in slightly increasing the accuracy of the model, but the overall result is not satisfactory. An r squared of 0.266 is very low.
There is a good correlation between density, residual sugar and alcohol.
##
## Calls:
## m10: lm(formula = (density ~ residual.sugar), data = wines)
## m11: lm(formula = density ~ residual.sugar + alcohol, data = wines)
##
## =====================================
## m10 m11
## -------------------------------------
## (Intercept) 0.991*** 1.005***
## (0.000) (0.000)
## residual.sugar 0.000*** 0.000***
## (0.000) (0.000)
## alcohol -0.001***
## (0.000)
## -------------------------------------
## R-squared 0.704 0.907
## adj. R-squared 0.704 0.907
## sigma 0.002 0.001
## F 11636.984 23791.076
## p 0.000 0.000
## Log-likelihood 24498.873 27328.019
## Deviance 0.013 0.004
## AIC -48991.747 -54648.037
## BIC -48972.257 -54622.051
## N 4898 4898
## =====================================
Infact this model is much better. Alcohol concentration and residual sugar are the main factors in determinating the density.
Yes, in general wines with lower density tend to have higher quality (correlation -0.307), while residual sugar does not seem to have a clear impact on the quality (correlation -0.097). Combining residual sugar and density, we can see that for a given density, wines with higher residual sugar have higher quality.
It was interesting how density is correlated with sugar and alcohol content. The longer the wine fermentation lasts, the lower is the residual sugar and the higher is the alcohol percentage. The final residual sugar and alcohol percentage are the main factors in density measure.
I created two models for the sample.
The first one to predict the quality of the wine based on the dataset features. This model was very weak, it had an R squared value of 0.266. It suggests that it is really hard to predict the quality of the wine based on the objective measurments of the wine chemical components.
The second model to predict the wine density based on residual sugar and alcohol. This model was quite accurate, with an R squared value of 0.9.
## [1] 0.9258881
The first plot shows the quality distribution of the wines in the dataset. The dataset contains wines which scored from 3 to 9 in a distribution close to binobial. There are very few wines scoring 9 and 3 quality points, while the wide majority of the wines (92.5 %) are scoring 5, 6 and 7 points.
## [1] 0.4355747
## [1] 0.4675664
## [1] -0.1321443
The exploratory analysis showed that alcohol percentage has an influence on wine quality (the correlation between alcohol and quality is 0.436), to explain this relation I created this box plot with the concentration of alcohol in wines for the different quality scores. There is a tendency for better wines (scoring 7 or above) to have a higher alcohol concentration. This almost linear correlation between score and alcohol concentration is only valid between the scores of 5 and 9 (the correlation between 5 and 9 between alcohol and score is 0.468), but there is a countertendency for scores lower than 5 (the correlation between 3 and 5 is -0.132). This countertendency makes the model function not reversible, therefore difficult to predict the score based on the alcohol percentage with a model.
## Warning: Removed 5 rows containing missing values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 13 rows containing missing values (geom_path).
This scatter plot represents the relation between wine density and residual sugar, colored by wine quality. The regression line represent the linear correlation between density and residual sugar. The plot shows how very good wines are more concetrated over the regression line, they tend to have lower density and higher residual sugar. This confirms the precedent plot, because the wine should have a high percentage of alcohol to have high residual sugar and low density.
The dataset was tidy and clean, so I had the chance to dig directly into the analysis. The ggpairs plot was very useful in spotting the possible variable correlation and gave me several insights. I had some struggles in finding the ggpairs documentation and in formatting it for the kint file.
Geographical position (and height above the sea) and year of wine production would be interesting to analyse. I think that this features can have a significant factor in determinating the wine quality because altitude and weather can have an impact on the sugar quantity before fermentation, so they would lead to a higher final alcohol volume and residual sugar.
The wines dataset shows that the wine quality appreciated by the humans is far more complex than the objective parameters of the wine chemical composition observed in the data set. It is not possible to judge the wine quality on these parameters alone, but there are some features that do have an impact on the perceived quality of the wine. In general we tend to prefer wines with high alcohol concentration percentage, while factors like chlorides, volatile acidity and total sulfur dioxide have a bad impact on wine taste.